Simple version of patterned_text with a single doc value for arguments #129292

parkertimmins · 2025-06-11T19:27:45Z

Initial version of patterned_text mapper. This will behavior similarly to match_only_text. This version uses a single SortedSetDocValues for a template and another for arguments. It splits the message by delimiters, the classifies a token as an argument if it contains a digit. All arguments are concatenated and inserted as a single doc value. A single inverted index is used, without positions. Phrase queries are still possible, using the SourceConfirmedTextQuery, but are not fast.

Part of #128932

Keep tests but treat all arguments as text args

martijnvg

Looks good @parkertimmins! I did a first pass, just storing all arguments in a one sorted set doc values field is a good first approach.

...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

modules/mapper-extras/src/main/java/module-info.java

...ed-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextIndexFieldData.java

.../main/java/org/elasticsearch/xpack/patternedtext/PatternedTextSyntheticFieldLoaderLayer.java

...ed-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextValueProcessor.java

...rned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextMapperPlugin.java

PatternedDocValues implemented SortedSetDocValues, but this does not work since there is no ordinal for the full patterned text. Instead use binary doc values which don't use a term dictionary

martijnvg · 2025-06-16T12:32:11Z

...tterned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextDocValues.java

+
+public class PatternedTextDocValues extends SortedSetDocValues {
+    private final SortedSetDocValues templateDocValues;
+    private final SortedSetDocValues argsDocValues;


I think this can be SortedDocValues. Since there is only args value per document? (all values are concatenated?)

Agreed, there should only be one template and one concatenated args per doc

...erned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldMapper.java

x-pack/plugin/mapper-patterned-text/src/yamlRestTest/resources/rest-api-spec/test/10_basic.yml

martijnvg

I left a few more comments.

...tterned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldType.java

modules/mapper-extras/src/main/java/module-info.java

...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

martijnvg · 2025-06-19T07:47:18Z

...erned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldMapper.java

+        // Add args doc_values
+        if (parts.args().isEmpty() == false) {
+            String remainingArgs = PatternedTextValueProcessor.encodeRemainingArgs(parts);
+            context.doc().add(new SortedSetDocValuesField(fieldType().argsFieldName(), new BytesRef(remainingArgs)));


Like an earlier comment, both SortedSetDocValuesField usages can be replaced here with SortedDocValuesField

x-pack/plugin/mapper-patterned-text/build.gradle

test/framework/src/main/java/org/elasticsearch/index/mapper/FieldTypeTestCase.java

x-pack/plugin/mapper-patterned-text/src/yamlRestTest/resources/rest-api-spec/test/30_sort.yml

...ed-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextIndexFieldData.java

...tterned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldType.java

parkertimmins · 2025-06-20T23:02:00Z

...ed-text/src/test/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldTypeTests.java

+        assertEquals(new ConstantScoreQuery(new TermQuery(new Term("field", "foo"))), ft.termQuery("foo", null));
+        assertEquals(AutomatonQueries.caseInsensitiveTermQuery(new Term("field", "fOo")), ft.termQueryCaseInsensitive("fOo", null));
+    }
+


I ended up removing the test testFetchDocValues: cd2f9aa#diff-88edd76ca94733a91a1061049481ab3e6e321b534c4d783d517dbb0186d9c30dL110

I was having trouble accessing the doc values without going to fielddata. But since the message should be accessed through source (presumably synthetic) it doesn't seem the message should be accessible as a doc value.

elasticsearchmachine · 2025-06-24T02:05:09Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-06-24T02:05:09Z

Hi @parkertimmins, I've created a changelog YAML for you.

martijnvg

Thanks @parkertimmins, a first step to get this new field type in! LGTM

Also we need to label this PR as non-issue, given that the new field type is gated with a feature flag and will not be available in released versions. The PR that removes the feature flag should have the right labels for the release notes.

martijnvg · 2025-06-24T13:27:00Z

...sdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextFieldMapper.java

+        this.indexAnalyzer = builder.analyzers.getIndexAnalyzer();
+        this.positionIncrementGap = builder.analyzers.positionIncrementGap.getValue();
+    }
+


In a follow up we should overwrite the iterator() method (from FieldMapper) and expose template field as keyword field mapper. This will allow us to make the template field a real sub field of the patterned_text field mapper, which allows it to be sortable, queryable and aggregateble.

...ogsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextDocValues.java

kkrik-es · 2025-06-24T13:55:14Z

...ogsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextFieldType.java

+
+    @Override
+    public ValueFetcher valueFetcher(SearchExecutionContext context, String format) {
+        return SourceValueFetcher.toString(name(), context, format);


Why do we use this here, instead of fetching from doc values?

This is similar to what happens in match only text field mapper. The idea was to make this change as small as possible.

kkrik-es · 2025-06-24T14:02:03Z

...ogsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextFieldType.java

+        return new SourceValueFetcherSortedBinaryIndexFieldData.Builder(
+            name(),
+            CoreValuesSourceType.KEYWORD,
+            SourceValueFetcher.toString(fieldDataContext.sourcePathsLookup().apply(name())),


@martijnvg does this fetch all synthetic source before filtering for this particular field?

This does generate the complete source from all fields, so this is slow. This is similar to match_only_text field type, its field data can only be used in script contexts (runtime fields).

For patterned_text, I don't think we should use its field data. But rely on field data of its sub fields (e.g. template field).

But there's no reason to fall back to the source here, since we have docvalues. We can synthesize the message from them and check - as done in the prototype. Same applies to phrase queries etc.

This is fine as a first iteration, but I think it needs to be addressed shortly.

This logic is now the same as in MatchOnlyTextFieldType. This field data will only be used in context of runtime fields. Field data usage of the message field directly is rare. In search api, sorting and aggregations on field are not allowed, and esql doesn't use field data. This can be improved, however I don't it is necessary.

In the PoC phase, index sorting was on the message field, but index sorting by the template field is sufficient (or template id field, when we add that). This is possible when template field is real sub field of patterned_text fields.

kkrik-es · 2025-06-24T14:10:43Z

.../src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextIndexFieldData.java

+
+    @Override
+    public SortField sortField(Object missingValue, MultiValueMode sortMode, XFieldComparatorSource.Nested nested, boolean reverse) {
+        throw new IllegalArgumentException("not supported for source patterned text field type");


I assume you're adding this to get you started, but this needs to be addressed asap. There's no point in using this field mapper without sorting..

There's no point in using this field mapper without sorting..

Typically logs are sorted by timestamp? I think we do want to support sorting by the template at some point, hence the comment about overwriting the iterator() method. Also I don't think match_only_text support sorting in the _search api? (field data is only used in search api)

It doesn't, but we won't get the storage improvements we measured without it.. I tested with sorting on the message (template) first, then on timestamp.

This can be addressed in a follow-up as well.

Right, so index sorting needs to be configured on the template sub field. This why I made this comment: #129292 (comment)

This will then allow us the configure index sorting on message.template and @timestamp fields. That way we get the same savings you observed.

That's possible but I think we'd be exposing too much of the internal implementation. For instance, say that, for some configuration, we also want to sort by some arg fields after the template field (e.g. because we store doubles separately). If we sort directly on the message field, we can apply that implicitly, without changing index settings. Otherwise, it'll be very tricky to apply.

I'm also on the fence about exposing the .template vs .template_id, but that's somewhat orthogonal.

kkrik-es

Well done, Parker. I've a few comments but they can be addressed in follow-up changes.

parkertimmins added 7 commits June 6, 2025 15:53

copy patternedtext directory from prototype

d0c9a2e

Remove timestamp and optimized args

78ef582

copy tests from prototype

6bb95ac

Remove all token type specific test cases

ad728f3

Keep tests but treat all arguments as text args

Fix bug introduced while moving to all shared args

a97bbc4

index original value

3a0e349

Remove type specific handling from tests

cda6b0a

elasticsearchmachine added the v9.1.0 label Jun 11, 2025

elasticsearchmachine and others added 5 commits June 11, 2025 19:34

[CI] Auto commit changes from spotless

30ae746

A few more changes from prototype

387757d

add messages with arguments to yaml tests

03b9fde

More yaml test updates

f9344bf

[CI] Auto commit changes from spotless

e4d4830

martijnvg reviewed Jun 13, 2025

View reviewed changes

...rned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextMapperPlugin.java Outdated Show resolved Hide resolved

parkertimmins and others added 4 commits June 13, 2025 15:41

Change to BinaryDocValues

fca0f83

PatternedDocValues implemented SortedSetDocValues, but this does not work since there is no ordinal for the full patterned text. Instead use binary doc values which don't use a term dictionary

[CI] Auto commit changes from spotless

f9b030b

address some pr feedback

4429474

remove unneeded change

10147ad

martijnvg reviewed Jun 16, 2025

View reviewed changes

parkertimmins commented Jun 18, 2025

View reviewed changes

...erned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldMapper.java Show resolved Hide resolved

parkertimmins commented Jun 18, 2025

View reviewed changes

x-pack/plugin/mapper-patterned-text/src/yamlRestTest/resources/rest-api-spec/test/10_basic.yml Outdated Show resolved Hide resolved

parkertimmins added 2 commits June 18, 2025 12:08

Merge branch 'main' into parker/pattern-text-initial-version

4a3ba41

Make patterned_text not aggregatable

b7450c2

martijnvg reviewed Jun 19, 2025

View reviewed changes

parkertimmins added 3 commits June 19, 2025 13:37

apply patch to simplify fielddata

8efac46

Fixes to fielddata and valuefetcher

cd2f9aa

removed wrong yaml test

906ca11

parkertimmins commented Jun 20, 2025

View reviewed changes

Add patterned_text feature_flag

4b63ed0

parkertimmins marked this pull request as ready for review June 24, 2025 02:02

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 24, 2025

parkertimmins added :StorageEngine/Mapping The storage related side of mappings >feature labels Jun 24, 2025

elasticsearchmachine added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Jun 24, 2025

Update docs/changelog/129292.yaml

fa022e6

Merge branch 'main' into parker/pattern-text-initial-version

8cd70fd

martijnvg approved these changes Jun 24, 2025

View reviewed changes

parkertimmins added >non-issue and removed >feature labels Jun 24, 2025

Delete docs/changelog/129292.yaml

526fef1

kkrik-es reviewed Jun 24, 2025

View reviewed changes

...ogsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextDocValues.java Outdated Show resolved Hide resolved

kkrik-es reviewed Jun 24, 2025

View reviewed changes

parkertimmins and others added 7 commits June 24, 2025 14:08

Throw exception if multiple messages in document

39e9d88

[CI] Auto commit changes from spotless

eb4212f

disable array tests since multiple values disallowed

e026a5c

Add doc values tests

a2bc5fa

[CI] Auto commit changes from spotless

f0da074

add comment

4e0c337

Merge branch 'main' into parker/pattern-text-initial-version

109afd4

kkrik-es approved these changes Jun 25, 2025

View reviewed changes

parkertimmins merged commit 9aaba25 into elastic:main Jun 26, 2025
33 checks passed

parkertimmins deleted the parker/pattern-text-initial-version branch June 26, 2025 02:31

parkertimmins mentioned this pull request Jul 7, 2025

Decompose patterned text message into doc values #128932

Closed

Simple version of patterned_text with a single doc value for arguments #129292

Simple version of patterned_text with a single doc value for arguments #129292

Uh oh!

Conversation

parkertimmins commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jun 24, 2025

Uh oh!

elasticsearchmachine commented Jun 24, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkrik-es left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

parkertimmins commented Jun 11, 2025 •

edited

Loading